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Abstract 

Background: With the progress of modern sequencing technologies a large number of complete genomes are now 
available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases 
where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they 
are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to 
their lengths when the organisms belong to different species. For these cases the comparison of complete genomes 
can be carried out only with ad hoc methods that are usually called alignment-free methods. 

Methods: In this paper we propose a distance function based on subword compositions called Underlying 
Approach (UA). We prove that the matching statistics, a popular concept in the field of string algorithms able to 
capture the statistics of common words between two sequences, can be derived from a small set of "independent" 
subwords, namely the irredundant common subwords. We define a distance-like measure based on these subwords, 
such that each region of genomes contributes only once, thus avoiding to count shared subwords a multiple number 
of times. In a nutshell, this filter discards subwords occurring in regions covered by other more significant subwords. 

Results: The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We 
prove that this set is by construction linear in the size of input, without overlaps, and can be efficiently constructed. 
Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach 
outperforms the current state of the art methods. Moreover, we show that the accuracy of UA is achieved with a very 
small number of subwords, which in some cases carry meaningful biological information. 

Availability: http://www.dei.unipd.it/~ciompin/main/underlying.html 
Keywords: Phylogeny, Alignment-free algorithms, Pattern discovery 



Background 

The global spread of low-cost high-throughput sequenc- 
ing technologies has made a large number of complete 
genomes publicly available, and this number is still grow- 
ing rapidly. In contrast, only few computational methods 
can really handle as input entire chromosomes, or entire 
genomes. 

Traditionally the comparison of related genomes is car- 
ried out by sequence alignment. Popular methods extract 
gene-specific sequences from all species under exam- 
ination and build a multiple sequence alignment for 
each gene [1]. Then all multiple sequence alignments 
are merged to form the final phylogeny. Other methods 
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[2] use genes as a dictionary, counting the presence or 
absence of a gene. This gene profile is then used to derive 
a similarity score. However, if the genomes in question do 
not share a common set of genes, or if they cannot be 
aligned to each other, e.g., due to substantially difi^erent 
lengths, these traditional techniques cannot be applied. As 
a general example, in a pairwise comparison of genomes 
popular alignment tools rely on a specific order of ele- 
ments for each genome sequence, and on a set of sparse 
shared seeds that should then be extended to obtain a 
global alignment. Therefore low sequence similarity, rear- 
rangements, and inversions can cause major problems 
in identifying a possible alignment and thus the actual 
sequence similarity. 

Furthermore, when considering whole genomes, the 
global alignment of large sequences has become a 
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prohibitive task even for supercomputers, hence sim- 
ply infeasible. To overcome these obstacles, in the last 
ten years a variety of alignment-free methods have been 
proposed. In principle they are all based on counting 
procedures that characterize a sequence based on its con- 
stituents, e.g., A:-mers [3,4]. 

An important aspect in phylogeny reconstruction is the 
fact that gene-based methods strictly focus on compar- 
ing the coding regions of genomes, which can account 
for as little as 1% of the genomic sequence in humans 
[5]. Whereas it is known that the use of whole genomes 
may provide more robust information when comparing 
different organisms [6]. Also most alignment- free meth- 
ods in the literature use only a portion of complete 
genomes [7]. For instance, there are approaches that use 
only genie regions [3] or mitochondria; other methods 
filter out regions that are highly repetitive or with low 
complexity [4]. Recently, it has been shown that the evo- 
lutionary information is also carried by non-genic regions 
[8]. For certain viruses, we are not even able to esti- 
mate a complete phylogeny by analyzing their genes, 
since these organisms may share a very limited genetic 
material [7]. 

Sims et al. recently applied the Feature Frequency Pro- 
files method (FFP) presented in [4] to compute a whole- 
genome phylogeny of mammals [8] — i.e., large eukaryotic 
genomes including the human genome — and of bacteria. 
This method needs to estimate the parameter k in order 
to compute a feature vector for each sequence, where the 
vector represents the frequency of each possible ^-mer. 
Each feature vector is then normalized by the total num- 
ber of ^-mers found (i.e., by the sequence length minus 
k-1), obtaining a probability distribution vector, or feature 
frequency profile, for each genome. FFP finally computes 
the distance matrix between all pairs of genomes by apply- 
ing the Jensen-Shannon [9] divergence to their frequency 
profiles. 

This general characterization of strings based on their 
subsequence composition closely resembles some of the 
information theory problems, and is tightly related with 
the compression of strings. In fact, compositional meth- 
ods can be viewed as the reinterpretation of data com- 
pression methods, well known in the literature [10,11]. 
For a comprehensive survey on the importance and impli- 
cations of data compression methods in computational 
biology, we refer the reader to [12]. 

When comparing entire genomes we want to avoid that 
large non-coding regions, which by nature tend to be 
highly repetitive, may contribute to our scoring function 
a multiple number of times, thus misleading the final sim- 
ilarity score. In fact, while analyzing massive genomes, 
the number of repeated patterns is very high, particularly 
in the non-genic regions. Furthermore if we allow mis- 
matches the number of patterns can grows exponentially 



[13-15]. In this paper we will address this problem by con- 
trolling the distribution of subwords over the sequences 
under consideration, so that their contribution will not be 
overcounted. 

Moreover, when comparing genomes it is well known 
that different evolutionary mechanisms can take place. In 
this framework, two closely related species are expected 
to share larger portions of DNA than two distant ones, 
whereby also other large complements and reverse- 
complements, or inversions, may occur [16]. In this work 
we will take into account all these symmetries, in order to 
define a measure of similarity between whole genomes. 

Matching statistics 

Among the many distance measures proposed in the lit- 
erature, which in most cases are dealing with k-mets, 
an effective and particularly elegant method is the Aver- 
age Common Subword approach (ACS), introduced by 
Ulitsky et al. [7]. They use a popular concept in the field 
of string algorithms, known as matching statistics [17]. In 
short, given two sequences s\ and S2, where si is the refer- 
ence sequence, the matching statistic is a vector I such that 
l[i] is the length of the longest subword starting at posi- 
tion i of 51 that is also a subword of S2, for every possible 
position i of si (see Table 1). 

A popular measure of similarity between strings is the 
average of this vector. In fact the general form of ACS is: 



We can notice the similarity with the cross entropy of 
two probability distributions P and Q: 

HiP,Q) = - Y,Pix)\ogqix), 

X 

where p{x) log q{x) measures the number of bits needed to 
code an event x from P if a different coding scheme based 
on Q is used, averaged over all possible events x. 

From the theoretical prospective it can be shown [7] that 
the ACS approach mimics the cross entropy estimated 
between two large sequences generated by a finite-state 
Markov process. In practice, this is closely related to the 
KuUback-Leibler information divergence, and represents 
the minimum number of bits needed to describe one 
string, given the other: Dkl{P II Q) = H{P, Q) - H{P). 

Table 1 Example of matching statistics /i [ i] and /2[y] for 



the strings Si = ACACGTAC and $2 = TACGTGTA 
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This is perhaps the most frequently used information- 
theoretic similarity measure. 

The advantage of using the matching statistics is that it 
is not based on fixed-length subwords, but it can capture 
also variable length matches, in contrast to most meth- 
ods that are based on fixed sets of Ar-mers. In fact, with 
the latter the choice of the parameter k is critical, and 
every method needs to estimate k from the data under 
examination, typically using empirical measurements [4]. 

For this reason ACS proved to be useful for reconstruct- 
ing whole-genome phylogenies of viruses, bacteria, and 
eukaryotes, outperforming in most cases the state-of-the- 
art methods [7]. 

Methods 

In this section we propose a distance measure between 
entire genomes based on the notion of underlying sub- 
words. In order to build a sound similarity measure 
between genomes, we need first to study the properties 
of the matching statistics. Our first contribution is the 
characterization of the subwords that are needed to com- 
pute the matching statistics. A second contribution is the 
selection of these subwords so that the resulting simi- 
larity measure does not contain overcounts. Our main 
idea is to avoid overlaps between selected subwords, more 
precisely by discarding common subwords occurring in 
regions covered by other more significant subwords. 

Irredundant common subwords 

In the literature, the values /[/] used by the ACS approach 
are called the matching statistics, as described in detail by 
Gusfield [17]. Our first contribution is to characterize the 
matching statistics in order to identify which subwords are 
essentials. 

It is well known that the total number of distinct sub- 
words of any length found in a sequence of length n can be 
at most ©(«^). Remarkably a notable family of fewer than 
2n subwords exist that is maximal in the host sequence, 
in the sense that it is impossible to extend a word in this 
class by appending one or more characters to it without 
losing some of its occurrences [18]. It has been shown 
that the matching statistics can be derived from this set of 
maximal subwords [19]. Here we will further tighten this 
bound by showing that to compute the matching statistics 
it is enough to consider a subset of the maximal subwords, 
called irredundant common subwords. 

The notion of irredundancy was introduced in [20] 
and later modified for the problem of protein com- 
parison [21,22]. It proved useful in different contexts 
from data compression [23] to the identification of tran- 
scription factors [24]. In this paper we introduce the 
concept of irredundant common subwords (i.e., without 
mismatches/wildcards). This ensures that there exists a 



close correspondence between the irredundant common 
subwords and the matching statistics. 

Definition 1. (Irredundant/Redundant common sub- 
word) A common subword w is irredundant if and only 
if at least an occurrence of w in s\ or S2 is not covered 
by other common subwords. A common subword that does 
not satisfy this condition is called a redundant common 
subword. 

We observe that the number of irredundant common 
subwords Isi^si is bounded hy m-\- n, where |si| = n and 
IS2I = m, since it is a subset of the set of maximal common 
subwords (see [19,25] for a more complete treatment of 
this topic). 

Proposition 1. The matching statistics 4i(0 can be 
computed by combining together all and only the irredun- 
dant common subwords ofs\ and S2. 

Proof. To show that the vector 4^ (i) can be derived from 
the irredundant common subwords, we define a new vec- 
tor of scores 1^, for each subword w, where /»[/] = |w| — 
/ ' + 1 represents the length of each suffix / of w, with 
j = 1, . . . ,\w\. Then, for each subword w mlsi,s2 we super- 
impose the vector /„, on all the occurrences of w in si. 
For each position i, in 5i, hiii) is the maximum value of 
the scores max„(lw[j] ), such that k -\- j = i and k is an 
occurrence of w. 

To complete the proof we have to show that every occur- 
rence of a common subword of s\ and S2 is covered by 
some occurrence of a subword in 2si,s2- definition of 
irredundant common subword, any occurrence of a sub- 
word corresponds to an irredundant common subwords 
or is covered by some subword in 2si,s2- Moreover every 
irredundant common subword w has at least an occur- 
rence i that is not covered by other subwords. Thus, 4i (0 
corresponds exactly to \ w\ and the subword w is necessary 
to compute the matching statistics. In conclusion, by using 
the method described above for 4^ (j), we can compute for 
each position the length of the maximum common sub- 
word starting in that location, which corresponds to the 
matching statistics. □ 

In summary, the notion of irredundant common sub- 
words is useful to decompose the information provided 
by the matching statistics into several patterns. Unfortu- 
nately these subwords can still overlap in some position. 
This might lead to an overcount in the matching statis- 
tics, in which the same region of the string contributes 
more than once. Our aim is to remove the possibility of 
overcount by filtering the most representative common 
subwords for each region of the sequences si and S2, which 
will also remove all overlaps. 
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Underlying subwords 

When comparing entire genomes we want to avoid that 
large non-coding regions, which by nature tend to be 
highly repetitive, may contribute to the similarity score a 
multiple number of times, thus misleading the final score. 
In fact, while analyzing massive genomes, the number of 
repeated patterns is very high, particularly in the non- 
genic regions. Therefore we need to filter out part of this 
information, and select the "best" common subword, by 
some measure, for each region of the sequences. 

In this regard, we must recall the definition of pattern 
priority and of underlying pattern, adapted from [26] to 
the case of pairwise sequence comparison. We will take as 
input the irredundant common subwords and the under- 
lying quorum u = 2, i.e. they must appears at least twice. 
Let now w and W be two distinct subwords. We say that 
w has priority over w' , or w — >• W , if and only if either 
\w\ > \w'\, or \w\ = \m/\ and the first occurrence of 
w appears before the first occurrence of w'. In this case, 
every subword can be defined just by its length and one of 
its starting positions in the sequences, meaning that any 
set of subwords is totally ordered with respect to the pri- 
ority rule. We say that an occurrence / of w is tied to an 
occurrence /' of a subword w', if the two occurrences over- 
lap, i.e. ([ /, / + I M/| - 1] n[ r, /' + |vi/| - 1] ) # 0, and m/ ^ 
w. Otherwise, we say that / is untied from 

Definition 2. (Underlying subword) A set of subwords 
^si,s2 — -^si.s2 '•^ ^^^^ underlying if and only if: 

(i) every subword w mUsi,s2> calledan underlying 
subword, has at least two occurrences, one in si and 
the other in S2, that are untied from all the untied 
occurrences of other subwords in Usi,s2 \ ^> ^'^d 

(ii) there does not exist a subword w e Isi^2 \ ^si,s2 such 
that w has at least two untied occurrences, one per 
sequence, from all the untied occurrences of 
subwords in Usi,s2- 

This subset of Isi^2 composed only by those sub- 
words that rank higher with our priority rule with respect 
to si. In fact, if there are overlaps between subwords that 
are in Isi,s2> we will select only the subwords with the 
highest priority. Similarly to the score ACS(si,S2), the set 
ltsi^2 is asymmetric and depends on the order of the two 
sequences; we will address this issue in Section "A dis- 
tance-like measure based on underlying subwords". As for 
the underlying patterns [26], one can show that the set of 
underlying subwords exists, and is unique. As a corollary 
we know that the untied occurrences of the underlying 
subwords can be mapped into the sequences si and S2 
without overlaps. Moreover, by definition, the total length 
of the untied occurrences cannot exceed the length of the 



sequences. These two properties are crucial when build- 
ing a similarity measure, because any similarity that is 
based on these subwords will count the contribution of a 
region of the sequence only once. 

Efficient computation of underlying subwords 

To summarize we select the irredundant common sub- 
words that best fit each region of si and S2, employing a 
technique that we call Underlying Approach or, in short, 
UA. This technique is based on a simple pipeline. We 
first select the irredundant common subwords and sub- 
sequently filter out the subwords that are not underlying. 
From a different perspective, we start from the smallest 
set of subwords that captures the matching statistics and 
remove the overlaps by applying our priority rule. In the 
following we show how to compute the irredundant com- 
mon subwords and the matching statistics, and then we 
present an approach for the selection of the underlying 
subwords among these subwords. The general structure of 
the Underlying Approach (UA) is the following: 

• 1) Compute the set of the irredundant common 
subwords lsi,s2 

• 2) Rank all subwords in Isi^2 according to the 
priority and initialize U to an empty set. 

• 3) Iteratively select a subwords p from 2si,s2 following 
the ranking. 

• 4a) If p has at least two untied occurrences: add p to 
U and update the corresponding regions of F (see 
next) in which p occurs; 

• 4b) otherwise, discard p and return to (3). 

Discovery of the irredundant common subwords 
In step (1) we construct the generalized suffix tree Ts^,s2 
of si and S2- We recall that an occurrence of a subword 
is (left)right-maximal if it cannot be covered from the 
(left)right by some other common subword. The first step 
consists in making a depth-first traversal of all nodes of 
Tsi,s2, and coloring each internal node with the colors of 
its leaves (each color corresponds to an input sequence). 
In this traversal, for each leaf i of Tsi,s2' we capture the 
lowest ancestor of i having both the colors ci and C2, say 
the node w. Then, iv is a common subword, and i is one of 
its right-maximal occurrences (in 5i or in 52); we select all 
subwords having at least one right-maximal occurrence. 
The resulting set will be linear in the size of the sequences, 
that is 0(m + n). This is only a superset of the irredun- 
dant common subwords, since the occurrences of these 
subwords could be not left-maximal. 

In a second phase, we map the length of each right- 
maximal occurrence i into /jj (i), and, using Proposition 1, 
we check which occurrences i have length greater than or 
equal to the length stored in the location i — 1 (for loca- 
tions i > 2). These occurrences are also left-maximal. 
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since they cannot be covered by a subword appearing at 
position i — 1. Finally we can retain all subwords that have 
at least an occurrence that is both right- and left-maximal, 
i.e, the set of irredundant common subwords Isi,s2- Note 
that, by employing the above technique, we are able to 
directly discover the irredundant common subwords and 
the matching statistics Is^ (i). 

The construction of the generalized suffix tree Tg^^g^ and 
the subsequent extraction of the irredundant common 
subwords Isi,s2 can be completed in time and space linear 
in the size of sequences. 

Selection of the underlying subwords 

In this section we describe, given the set of the irre- 
dundant common subwords 2si,s2) how to filter out the 
subwords that are not underlying, obtaining the set of 
underlying subwords Usi,s2- 

The extraction of underlying subwords takes as input 
the set 2si,s2 '"^^ the tree Tsi,s2 from the previous section. 
First we need to sort all subwords in Isi,s2 according to 
the priority rule (step 2). Then, starting from the top 
subword, we analyze iteratively all subwords by checking 
their untied occurrences (step 3). If the subword passes 
a validity test we select it as underlying (step 4a), oth- 
erwise we move on with the next subword (step 4b). 
The two key steps of this algorithm are: sorting the sub- 
words (step 2) and checking for their untied occurrences 
(step 4a). 

Step 2 is implemented as follows. For all subwords we 
retrieve their lengths and first occurrences in si from 
the tree Tg^^^- Then each subword is characterized by its 
length and the first occurrence. Since these are integers in 
the range [ 0, n] we can apply radix sort [27], first by length 
and then by occurrence. This step can be done in linear 
time. 

In order to implement step 4a we need to define the 
vector r of M booleans, representing the locations of si. 
If r[i] is TRUE, then the location / is covered by some 
untied occurrence. We also preprocess the input tree and 
add a link for all nodes v to the closest irredundant ances- 
tor, say prec{v). This can be done by traversing the tree 
in preorder. During the visit of a the node v if it is not 
irredundant we transmit to the children prec{v) otherwise 
if V is irredundant we transmit v. This preprocess can be 
implemented is linear time and space. 

For each subword w in Tg^^^ we consider the list £„, of 
occurrences to be checked. All Cw are initialized in the 
following way. Every leaf v, that represent a position i, 
send its value i to the location list of the closest irredun- 
dant ancestor using the link prec(v). Again this preprocess 
takes linear time and space since all positions appear in 
exactly one location list. We will updated these lists J(L^f 
only with the occurrences to be checked, i.e. that are not 
covered by some underlying subword already discovered. 



We start analyzing the top subword w and for this case C„, 
is composed by all the occurrences of w. 

For each occurrence i of w we need to check only its first 
and last location in the vector F; i.e., we need to check 
the locations r[i] and r[ j -|- |w| — 1]. If one of these two 
values is set to true, then i is tied by some subword m/. 
Otherwise, if both the values are set to FALSE, then i must 
be untied from all other subwords. Since all subwords 
already evaluated are not shorter than w, then they can- 
not cover some locations mr[i,i + \w\ — 1] without also 
covering r[i] or r[i + \w\ - 1]. Thus, if r[j] and r[« -f- 
\w\ — 1] are both set to false, we mark this occurrence 
i as untied for the subword w and update the vector T 
accordingly. 

If r[i] is TRUE we can completely discard the occur- 
rence /, for the subword w and also for all its prefixes, that 
are represented by the ancestors of w in the tree Tg-^^g^. 
Thus the occurrence i will no longer be evaluated for any 
other subword. 

If r[j] is FALSE and r[i + \w\ - 1] is TRUE, we need 
to further evaluate this occurrence for some ancestors of 
w. In this case, one can compute the longest prefix, w', of 
w such that r[i + \w'\ — 1] is set to FALSE and w' is an 
irredundant common subword. Then the occurrence i is 
inserted into the list C„/. 

This step is performed by first computing the length 
d < \w\ such that r[ / -I- - 1] is FALSE and r[i + d] 
is TRUE, and then retrieving the corresponding prefix w' 
of w in the tree that spells an irredundant common sub- 
word with length equal to or shorter than d. We can 
compute d by means of a length table / in support (or 
in place) of the boolean vector F. For each untied occur- 
rence i of w, X stores the values [ 1, 2, . . . , in the 
locations [i, i + 1, . . . , i + \w\ — 1], similarly to the proof 
of Proposition 1. Using this auxiliary table we can com- 
pute the value of d for the location under study i as 

d=\w\- xii + \w\-^]■ 
Now, to select m/, the longest prefix of w with |m/| < 
d, we employ an algorithm proposed by Kopelowitz and 
Lewenstein [28] for solving the weighted ancestor problem, 
where weights correspond to the length of words spelled 
in the path from the root to each node, in case of a suffix 
tree. In the weighted ancestor problem one preprocesses 
a weighted tree to support fast predecessor queries on the 
path from a query node to the root. That is, with a lin- 
ear preprocessing on a tree of height n, using the above 
algorithm it is possible to locate any ancestor node W 
that has a weight less than d in time O (log log k). In our 
case, the maximum length for an irredundant subword is 
min{m, «), thus we can find a suitable ancestor m/ of w in 
time 0(loglogmin{»?, «}), with 0{m+ri) preprocessing of 
the tree Tg^^^. 

At the end of the process, if the subword w has at least 
one untied occurrence per sequence, then we mark w as 
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underlying subword. Otherwise, all the occurrences of w 
that are not covered are sent to its ancestors, using the 
previous procedure. 

To analyze the overall complexity we need to compute 
how many times the same location i is evaluated. Suppose, 
for example, that i belongs to jC^ of the subword w. The 
location i is evaluated again for some w, and inserted into 
the list Cw, only if r[i] is FALSE and Ffj + |w| - 1] is 
TRUE. Note that the locations not already covered are in 
the range [ i, i+ \w\—d— 1], with d > 0. Then, the subword 
w is the longest prefix of w that is an irredundant common 
subword and that lives completely in the locations [ i, i + 
\w\ — d — 1]; however w may not cover the entire interval. 
Now, the occurrence i will be evaluated again only if there 
exists another subword W that overlaps with mi, and that 
has a higher priority with respect to w. The worst case is 
when w' ends exactly at position —d—\ and overlaps 
with w by only one location. Since W must be evaluated 
before w, then \w'\ > \w\. Thus the worst case is when the 
two subwords have about the same length. In this settings 
the length of the subword w can be at most {\w\—d)l1. We 
can iterate this argument at most O(log \w\) times for the 
same position i. Therefore any location can be evaluated at 
most 0(logmin{w, «)) times. In conclusion, our approach 
requires 0((m + «) logminjm, «) loglogminjw, «}) time 
and 0{m + n) space to discover the set of all underlying 
subwords ZYsi, 



Extension to inversions and complements 

In this section we discuss the extension of the algorithmic 
structure discussed above to accommodate also inversion 
and complement matches. 

A simple idea is to concatenate each sequence with 
its inverse and its complement, while keeping separate 
the occurrences coming from direct matches, inversions, 
and complements. In brief, we first define x as the con- 
catenation of a string x with its inverse, followed by its 
complement, in this exact order. Then, we compute the 
irredundant common subwords, Is\j2' '^he sequences 
Si and S2- We subsequently select the underlying sub- 
words by ranking all the irredundant common subwords 
in the set Isx,s2- Using the same algorithm described above 
we compute the set Us^^s2> then we map each sub- 
word occurrence to the reference sequences si. This will 
include also inversions and complements of S2 that are 
shared by s\. In this way, we can store all the untied occur- 
rences and consider all possible matches for each region 
of Sl. 

In this framework, we choose to take into account all 
these symmetries, and thus the experiments presented 
will use this extended approach. We will also measure 
the contribution of inversions and complements to our 
similarity measure. 



A distance-like measure based on underlying subwords 

In the following we report the basic steps of our distance- 
like measure. Let us assume that we have computed 

ZYsi,s2i and the other specular set Us2,si- For every subword 



w e Usi,s2 we sum up the score ^;Ji! — hw\w\(\w\ + 
l)/2 in UA(si,S2), where hw is the number of its untied 
occurrences in 5i, similarly to ACS [7]. Then, we average 
UA{si, $2) over the length of the first sequence, si, yielding 



UA{si,S2) = 



2n 



This is a similarity score that is large when two sequences 
are similar, therefore we take its inverse. 

Moreover, for a fixed sequence si this score can also 
grow with the length of S2, since the probability of having 
a match in s\ increases with the length of S2- For this rea- 
son, we consider the measure log4(|s2|)/lM(si, S2); we use 
a base-4 logarithm since DNA sequences have four bases. 
Another issue with the above formula is the fact that it is 
not equal to zero for s\ = S2; thus we subtract the cor- 
rection term log4(|si|)/tM(si,5i), which ensures that this 
condition is always satisfied. Since Usi^i contains only one 
subword, the sequence si itself, which trivially has only 
one untied occurrence in si, this yields to UA(si,si) = 
|sil(|siH-l)/(2|5i|) = I + l)/2. The following formulas 
accommodate all of these observations in a symmetrical 
distance-like measure duAisi,S2) between the sequences 
5i and 52= 



UA(si,S2) 



duA{S\,S2) 



l0g4(l^2l) _ 2l0g4(|Sl|) 
UA{Sx,S2) (|si|+l)' 



UA{si,S2) + UA{s2,si) 



We can easily see that the correction term rapidly con- 
verges to zero as |si | increases. Moreover, we notice that 
duA{s\,S2) grows as the two sequences Si and $2 diverge. 
From now we will simply refer to the measure duA {si, S2) 
as the Underlying Approach measure, or UA. 

Results 

Genome datasets and reference taxonomies 

We assess the effectiveness of the Underlying Approach 
on the estimation of whole-genome phylogenies of differ- 
ent organisms. We tested our distance function on three 
types of datasets: viruses, prokaryotes, and unicellular 
eukaryotes. 

In the first dataset we selected 54 virus isolates of the 
2009 human pandemic Influenza A - subtype HlNl, also 
called the "Swine Flu." The Influenza A virion has eight 
segments of viral RNA with different functions. These 
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Table 2 Benchmark for prokaryotes - Archaea & Bacteria 
domains 



Accession No. 


Domain 


Organism 


Size 


BA000002 


archaea 


aeropyrum pernix str, K1 


1 .7 Mbp 


AE000782 


archaea 


archaeoglobus fulgidus 
str. DSM 4304 


2.2 Mbp 


AE009439 


archaea 


methanopyrus kandleri 
str. AVI 9 


1 .7 Mbp 


AE010299 


archaea 


methanosarcina 

acetivorans str. C2A 


5.8 Mbp 


AE009441 


archaea 


pyrobaculum 
aerophilum str. IM2 


2.3 Mbp 


AL096836 


archaea 


pyrococcusabyssi 


1.8 Mbp 


AE009950 


archaea 


pyrococcus furiosus str. 
DSM 3638 


1 .9 Mbp 


AE000520 


archaea 


treponema pallidum sp. 
pall. str. Nichols 


1 .2 Mbp 


AE017225 


bacteria 


bacillus anthracis str. 
Sterne 


5.3 Mbp 


AL009126 


bacteria 


bacillus subtilis subsp. 
subtilisstr. 168 


4.3 Mbp 


AE013218 


bacteria 


buchnera aphidicola str. 
sg 


651 kbp 


AL111168 


bacteria 


Campylobacter jejuni sp. 
jej.str. NCTC 11168 


1 .7 Mbp 


AE002160 


bacteria 


chlamydia muridarum 

str. MoPn/Wiess-Nigg 


1.1 Mbp 


AM884176 


bacteria 


chlamydia trachomatis 
str. L2/434/BU 


1.1 Mbp 


AE016828 


bacteria 


coxiella burnetii str. RSA 
493 


2.0 Mbp 


AE017285 


bacteria 


desulfovibrio vulgaris sp. 
vulg. str. Hildenb. 


3.6 Mbp 


L42023 


bacteria 


haemophilus influenzae 
str. Rd KW20 


1.9 Mbp 


CP001037 


bacteria 


nostoc punctiforme str. 
PCC 73102 


8.4 Mbp 


Prokaryotic taxa used in our experiments, divided by domain. For each entity, 



we list the accession number in the NCBI genome database, the complete name 
and strain, and the genome size. 

RNAs are directly extracted from infected host cells, and 
synthesized into complementary DNA by reverse tran- 
scription reaction, where a specific gene amplification is 
performed for each segment [29]. We concatenate these 
segments according to their conventional order given by 
the literature [30]; this step, in general, does not affect the 
final phylogeny computed by our algorithm, and is used 
to sort subwords by location. The resulting sequences 
are very similar to each other, and have lengths in the 
order of 13,200 nucleotides each, accounting for a total of 
714,402 b. To compute a reference taxonomic tree, we per- 
form multiple sequence alignment using the ClustalW2 
[31] tooP as suggested by many scientific articles on the 
2009 Swine Flu [29,30]. Then, we compute the tree using 



Table 3 Plasmodium are parasites known as causative 
agents of malaria in different hosts and geographic regions 



Parasite 


Host 


Region 


Size 


P. berghei 


rodent 


Africa 


18.5 Mbp 


P. chabaudi 


rodent 


Africa 


18.8 Mbp 


P. falciparum 


human 


Africa, Asia & S./C. America 


23.3 Mbp 


P. knowlesi 


macaque 


Southeast Asia 


23.7 Mbp 


P. vivax 


human 


Africa, Asia & S./C. America 


22.6 Mbp 



The right-most column lists the size of each complete DNA genome. 



the Dnaml tool from the PHYLIP [32] software package,'' 
which implements the maximum likelihood method for 
aligned DNA sequences. In Dnaml we used the param- 
eters suggested in [29,30], which consider empirical base 
frequencies, constant rate variation among sites (with no 
weights), a transition ratio of 2.0, and best tree search 
based on proper searching heuristics. 

In the second dataset we selected 18 prokaryotic organ- 
isms among the species used in [7] for a DNA phyloge- 
nomic inference. We chose the species whose phyloge- 
nomic tree can be inferred by well-established methods 
in the literature (see Table 2). The organisms come from 
both the major prokaryotic domains: Archaea, 8 organ- 
isms in total, and Bacteria, 10 organisms in total. The 
sequences in question have lengths ranging from 0.6 Mbp 
to 8 Mbp, accounting for a total 48 Mbp. We compute 
their tree-of-life by using genes that code for the 16S 
RNA, the main RNA molecule inside the small riboso- 
mal subunit characterizing prokaryotes and widely used 
to reconstruct their phylogeny; the considered sequences 
are called 16S rDNA. We can extract a multiple align- 
ment of 16S rDNA sequences of the selected organisms 
from the Ribosomal Database Project [33] ;'^ our exper- 
iments are based on the release 8.1. Next, we perform 
a maximum likelihood estimation on the aligned set of 
sequences, employing ZJwawj/ from PHYLIP with standard 
parameters, in order to compute a reference tree based on 
the resulting estimation. 

In the third dataset we selected five eukaryotic organ- 
isms of the protozoan genus Plasmodium whose genomes 



Table 4 Comparison of whole-genome phylogeny 
reconstructions 



Species 


Group 


UA 


ACS 


FFP 


FFPfiy 


Influenza A 


Viruses 


80/102 


84/1 02 


100/102 


96/1 02 


Archaea 


Prokaryotes 


4/10 


4/10 


6/10 


6/10 


Bacteria 


Prokaryotes 


6/14 


10/14 


6/14 


10/14 


Arch. & Bact. 


Prokaryotes 


20/30 


22/30 


20/30 


22/30 


Plasmodium 


Eukaryotes 


0/4 


0/4 


4/4 


0/4 



Normalized Robinson-Foulds scores with the corresponding reference tree. For 
each dataset the best results are in bold. 
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have been completely sequenced (Table 3). Plasmod- 
ium are unicellular eukaryotic parasites best known as 
the etiological agents of malaria infectious disease. The 
sequences have lengths ranging from 18 Mbp to 24 Mbp, 
accounting for a total 106 Mbp. We used as reference 
tree the taxonomy computed by Martinsen et al. [34], as 
suggested by the Tree of Life Project. 

Whole-genome phylogeny reconstruction 

We exploited the above datasets to compare our method, 
the Underlying Approach (UA), with other efficient state- 
of-the-art approaches in the whole-genome phylogeny 
reconstruction challenge: ACS [7], FFP [4]"^ and FFPry- 
The FFPijy method, in contrast to FFP, employs the 
Purine-Pyrimidine reduced alphabet (RY) which is com- 
posed by two character classes: [A,G] (both purine 



bases, denoted by R) [C, T] (both pyrimidines, denoted 
by Y). We implemented the ACS method by ourselves, 
while for FFP and FFP^jy we used the FFP package release 
3.14 available online. 

We reconstruct the phylogenomic trees from the dis- 
tance matrices using the Neighbor-joining method as 
implemented in the PHYLIP package. We compare the 
resulting topologies with the respective reference trees 
using the symmetric difference of Robinson and Foulds 
(R-F) and the triplet distance. For two unrooted binary 
trees with « > 3 leaves, the R-F score is in the range 
[0, 2« — 6]. A score equal to 0 means that the two trees are 
isomorphic, while 2« — 6 means that all non-trivial bipar- 
titions are different. The R-F difference between two or 
more trees can be computed using the TreeDist tool from 
the PHYLIP package. 



. A/Bagota/0466N/2009H1N1 -2009/06/25 



. A/Italy/05/2009H1N1-2009/05/03 



. A/Mexico/4108/2009H1N1 -2009/04/02 



. A/California/06/2009H1N1-2009/04/16 



- A/Canada-NS/RV1565/2009H1N1-2009/04/30 

A/SantoDomingo/WR 1 06aN/2009H1N1 -2009/06/25 



. A/Paris/2580/2009H1 N1 -2009/04/30 



. A/Mexico/lnDRE13495/2009H1N1-2009/04/29 



A/Paris/2590/2009H1 N1 -2009/04/30 

A/SantoDomingo/572N/2009H1 N1 -2009/05/24 

. A/Toronto/T5308/2009H1 N1 -2009/06/03 



. A/England/1 95/2009H1 N1 -2009/04/28 
_ A/Paris/2592/2009H1N1 -2009/05/01 

A/Toronto/0462/2009H1N1 -2009/05/26 



. A/Beiiing/01/2009H1N1-2009/05/15 



. A/Shanghai/71T/2009H1N1 -2009/05/31 

A/Beijing/3/2009H1N1 -2009/05/23 

I A/California/04/2009H1N1 -2009/04/01 

I A/Fukuoka-C/1/2009H1 N1 -2009/06/07 

A/Callfornla/07/2009H1N1 -2009/04/09 



- A/Hiroshima/216/2009H1N1 -2009/06/30 

A/ltaly/127/2009H1N1 -2009/06/1 7 

A/NewYork/4197/2009H1N1 -2009/06/1 7 

A/Thalland/CU-B5/2009H1N1-2009/06/13 



. A/Texas/04/2009H1N1 -2009/04/1 4 
A/Texas/09/2009H1N1 -2009/04/25 



. A/Texas/08/2009H1 N1 -2009/04/24 



A/NewYork/06/2009H1N1 -2009/04/25 

A/Denmark/524/2009H1N1-2009/06/04 

A/Fukushima/1/2009H1N1 -2009/06/23 

A/NewYork/4870/2009H1N1 -2009/09/10 

I A/Shanghai/1/2009H1N1 -2009/05/23 

I A/Thailand/CU-H9/2009H1N1-2009/06/17 

OW/03/2009H1N1 -2009/05/26 

A/Taiwan/T0826/2009H1N1 -2009/07/10 

A/Yakohama/1/2009H1N1 -2009/06/09 



. A/Canada-0C/RV1 954/2009H1 N1 -2009/05/1 7 
A/Denmark/528/2009H1N1 -2009/06/09 



. A/Korea/01/2009H1N1-2009/05/02 

A/SanSalvador/0169T/2009H1N1 -2009/06/1 2 



A/auangdong/02/2009H1N1 -2009/05/27 

. A/lta!y/85/2009H1N1-2009/06/14 

A/Japan/PR1070/2009H1N1 -2009/07/1 0 



. A/NewYork/3166/2009H1N1 -2009/04/26 
A/Califarnia/14/2009H1N1 -2009/04/25 



. A/NewYork/3354/2009H1N1 -2009/05/08 

. A/ltaly/49/2009H1 N1 -2009/05/27 




A/MexicaCity/WR1312N/2009H1N1-2009/09/10 



. A/NewYork/4777/2009H1 N1 -2009/08/1 4 



A/Utah/05/2009H1N1 -2009/06/1 4 
A/MOSCOW/IIV05/2009H1 N1 -2009/06/20 
n/T0724/2009H1N1 -2009/05/1 9 



'ada-ON/RV1527/2009H1N1 -2009/04/24 

A/Hiroshima/200/2009H1N1 -2009/06/1 3 



Figure 1 Whole-genome phylogeny of the 2009 world pandemic Influenza A (H1N1) generated by UA. In green and red are represent the 
two main clades, where the green Mexico/4108 is probably the closest isolate to the origin of the influenza. In blue and orange are two of the 
possible early evolutions of the viral disease. The organisms which do not fall into one of the two main clades according to the literature are in black. 
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We ran FFP and FFP^jy for different values of k (the 
fixed subword length) as suggested by [4], retaining the 
best results in agreement with the reference trees. Table 4 
compares our method with the other state-of-the-art 
approaches, by showing the R-F difference with respect to 
the reference taxonomy tree. 

Our method, UA, achieves good performance in every 
test considering the R-F difference with the reference tax- 
onomy tree, and very good performance if we further 
analyze the resulting phylogenies, as in Figures 1, 2, and 
3. For every dataset the best results are shown in bold. 
We can observe that UA is constantly the best performing 
method, and that this advantage becomes more evident 
for large dataset, where sequences share large parts, such 
as the Influenza A (HlNl) viruses. 



The Robinson and Foulds distance is a standard method 
to evaluate topological discordance between trees. How- 
ever when dealing with large trees it is known that small 
variations can generate very large R-F scores (typically, 
already for n ~ 10). For this reason we conducted a 
second series of experiments using the triplet distance 
[35]. The triplet distance is a more refined measure that 
does not suffer this problem. Moreover, to better compare 
all taxonomies, we report the triplet distance between 
all trees. Tables 5, 6 and 7 show the triplet distance 
between all trees for all datasets. This more refined mea- 
sure confirms the applicability of UA with respect to FFP 
and ACS. 

In more detail. Figure 1 shows that our approach can 
distinguish the two main clades of the 2009 Influenza 



I Pyrococcus abyssi - archaea 

I Pyrococcus furiosus - archaea 

I Treponema pallidum subsp. pallidum - archaea 

I I Archaeoglobus fulgidus - archaea 

I I Pyrobaculum aerophilum - archaea 

I Methanopyrus kandlerl - archaea 

I Desulfovibrio vulgaris subsp. vulgaris - bacteria 

I Aeropyrum per nix - archaea 

I Chlamydia muridarum - bacteria 

I Chlamydia trachomatis - bacteria 

I Methanosarcina acetivorans - archaea 

I Bacillus anthracis - bacteria 

I Bacillus subtilis subsp. subtilis - bacteria 

I Buchnera aphidicola - bacteria 

I Campylobacter jejuni subsp. jejuni - bacteria 

I Haemophilus influenzae - bacteria 

I Coxiella burnetii - bacteria 

'■lostoc punctiforme - bacteria 

Figure 2 Whole-genome phylogeny of prokaryotes by UA. In red are the branches of the /\rcf)oea domain, while in green are those of the 
Bacteria domain. Clusters of other organisms are highlighted with different colors. Only two organisms do not fall into the correct clade: 
Methanosarcina acetivorans - archaea (in cyan) and Desulfovibrio vulgaris subspecies vulgaris - bacteria (in black). 
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0.2266 



P. falciparum 



0.3848 



P. berghei 



0.3902 



P. chabaudi 



0.1723 



P. knowlesi 



P. vivax 



Figure 3 Whole-genome phylogeny of the genus Plasmodium by UA, with our whole-genome distance highlighted on the branches. 



A-HINI (in green and red), which have been outlined in 
[30]. The origin of the flu could reside in the Mexican iso- 
late (Mexico/4108, in green), from which all other green 
isolates may have ensued. Two sub-clades for the U.S. 
states of California and Texas are highlighted within the 
red clade, most probably corresponding to the first major 
evolutions of the viral disease. 

Similar results are obtained for the second dataset, as 
shown in Figure 2. UA can easily distinguish the Archaea 
domain, in red, from the Bacteria domain, in green, and 
also other sub-clades with respect to the reference tree 
(these sub-clades are highlighted in the figure with differ- 
ent colors). The organisms in black do not form a clade 



with other organisms in the reference tree. For the third 
dataset (Figure 3), the whole-genome phylogeny of the 
genus Plasmodium generated by UA corresponds exactly 
to the taxonomy found in the literature. 

The accuracy results are promising, but we believe that 
of equal interest are the patterns used for the classifi- 
cation. Our approach, by construction, uses only a very 
small number of patterns. For this reason we report 
in Table 8 some statistics for the underlying subwords 
selected, averaged over all experiments. We can notice 
that the number of irredundant patterns is in general 
smaller than the length of the genomes, and this is a first 



Table 5 Comparison of whole-genome phylogeny of 
influenza virus 



Viruses 


Reference 


UA 


ACS 


FFP 


FFPfiy 


Reference 


0.0 


0.60 


0.63 


0.86 


0.88 


UA 


0.60 


0.0 


0.30 


0.81 


0.74 


ACS 


0.63 


0.30 


0.0 


0.83 


0.81 


FFP 


0.86 


0.81 


0.83 


0.0 


0.73 


FFPfiy 


0.88 


0.74 


0.81 


0.73 


0.0 



Table 6 Comparison of whole-genome phylogeny of 
prokaryotes 



Prokaryotes 


Reference 


UA 


ACS 


FFP 


FFPrv 


Reference 


0.0 


0.24 


0.37 


0.62 


0.39 


UA 


0.24 


0.0 


0.37 


0.55 


0.47 


ACS 


0.37 


0.37 


0.0 


0.59 


0.48 


FFP 


0.62 


0.55 


0.59 


0.0 


0.57 


FFP«y 


0.39 


047 


0.48 


0.57 


0.0 



Normalized triplet distance between all trees. The best results are in bold. 



Comparison of Whole-Genome Phylogeny of Prokaryotes. Normalized triplet 
distance between all trees. The best results are in bold. 
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Table 7 Comparison of whole-genome phylogeny of 
Plasmodium 



Plasmodium 


Reference 


UA 


ACS 


FFP 


FFPrv 


Reference 


0.0 


0.0 


0.0 


0.4 


0.0 


UA 


0.0 


0.0 


0.0 


0.3 


0.0 


ACS 


0.0 


0.0 


0.0 


0.3 


0.0 


FFP 


0,4 


0.3 


0.0 


0.0 


0.3 


FFPftv 


0.0 


0.0 


0.0 


0.3 


0.0 



Normalized triplet distance between all trees. The best results are in bold. 



form of information filtering. Moreover we can observe 
that only a few underlying subwords are selected on 
average among the irredundant common subwords. This 
number is always very small when compared with all pos- 
sible irredundant subwords, and much smaller than the 
length of the sequences. 

Similar considerations can be drawn for the underlying 
subwords length. On average they can be very long, espe- 
cially with respect to FFP that uses only k-mers with k 
in the range [5,10]. Furthermore, each underlying sub- 
word occurs only a few times per sequence, and in general 
about one occurrence per sequence. Removing the high- 
frequency subwords, we can notice that the underlying 
subwords typically have length > log4 min{w, «), and in 
the case of viruses they can be very large, capturing more 
information than FFP. The longest underlying subwords 
appear in the virus dataset, and they are on the order 
of a thousand bases. We checked if these subwords may 
have some biological meaning and we found that in some 
cases they correspond to whole viral segments that are 
shared between two genomes. This confirms that, in some 
cases, the underlying subwords used for classification can 
capture some biological insight. 

Another interesting aspect is the contribution of inver- 
sions and complements in our similarity measure, with 
respect to the classical notion of match. We compute 
the average number of occurrences used in our scoring 
function that is due to inversions and complements. The 
contribution of inversions and complements is about 28- 
33% and 19-20%, respectively. This fact may be due to the 
nature of the sequences considered, but we believe that 
this topic deserves more attention. 

Conclusion 

In conclusion, we have shown that the underlying sub- 
words can be used for the reconstruction of phyloge- 
netic trees. Preliminary experiments have shown very 
good performance in the identification of major clus- 
ters for viruses, prokaryotes, and unicellular eukaryotes. 
An important observation that distinguishes our meth- 
ods from the others is that only a small number of 



Table 8 Main statistics for the underlying approach 



averaged over all experiments 


Counting 


Influenza A 


Arch. & Bact. 


Plasmodium 


Min genome size 


12,976 b 


650 kbp 


18,524 kbp 


Max genome size 


13,511 b 


8,350 l<bp 


23,730 kbp 


Average genome size 


13,230 b 


2,700 kbp 


21,380 kbp 


Irredundants 1X5,^2! 


3,722 


3,167 k 


16,354 k 


Underlying subwords 


60 


112k 


706 k 










Min \w\ in Wsi,s2 


6 


10 


12 


Max |w| in Ws,,s2 


1,615 


25 


266 


Average \w\ in W5,,5j 


264 


14 


20 


Untied inversions 


28% 


31% 


33% 


Untied complements 


22% 


20% 


19% 



underlying subwords is used in our distance, nevertheless 
the results are promising. From this fact we can spec- 
ulate that only a very limited number of subwords is 
needed to establish the phylogeny of genomic sequences. 
Thus, an interesting problem that can be addressed using 
the underlying subwords is the selection of probes for 
DNA chips. 

In the future, we plan to extend this method for the 
comparison of whole genomes based on short reads com- 
ing from next-generation sequencing, instead of using 
assembled genomes. 

Endnotes 

^ClustalW2 is available at http://www.ebi.ac.uk/Tools/ 
msa/clustalw2. 

'^PHYLIP (phylogenetic inference package) is a free com- 
putational phylogenetics software package available at 
http://evolution.genetics.washington.edu/phylip. 
'^The Ribosomal Database Project is available at http:// 
rdp.cme.msu.edu. 

'^The FFP software package release 3.14 is available at 
http:/ /fFp- phylogeny.sourceforge.net. 
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